evaluation plan
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Saha, Swarnadeep, Li, Xian, Ghazvininejad, Marjan, Weston, Jason, Wang, Tianlu
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
"Close...but not as good as an educator." -- Using ChatGPT to provide formative feedback in large-class collaborative learning
Ponte, Cory Dal, Dushyanthen, Sathana, Lyons, Kayley
Delivering personalised, formative feedback to multiple problem-based learning groups in a short time period can be almost impossible. We employed ChatGPT to provide personalised formative feedback in a one-hour Zoom break-out room activity that taught practicing health professionals how to formulate evaluation plans for digital health initiatives. Learners completed an evaluation survey that included Likert scales and open-ended questions that were analysed. Half of the 44 survey respondents had never used ChatGPT before. Overall, respondents found the feedback favourable, described a wide range of group dynamics, and had adaptive responses to the feedback, yet only three groups used the feedback loop to improve their evaluation plans. Future educators can learn from our experience including engineering prompts, providing instructions on how to use ChatGPT, and scaffolding optimal group interactions with ChatGPT. Future researchers should explore the influence of ChatGPT on group dynamics and derive design principles for the use of ChatGPT in collaborative learning.
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Saha, Swarnadeep, Levy, Omer, Celikyilmaz, Asli, Bansal, Mohit, Weston, Jason, Li, Xian
Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint story generation task, BSM improves the coherence of the stories while also improving constraint satisfaction by 12%.
Precomputing Datalog evaluation plans in large-scale scenarios
Fiorentino, Alessio, Leone, Nicola, Manna, Marco, Perri, Simona, Zangari, Jessica
In this scenario, to reduce memory consumption and possibly optimize execution times, the paper proposes novel techniques to determine an optimal indexing schema for the underlying database together with suitable body-orderings for the Datalog rules. The new approach is compared with the standard execution plans implemented in DL V over widely used ontological benchmarks. The results confirm that the memory usage can be significantly reduced without paying any cost in efficiency. This paper is under consideration in Theory and Practice of Logic Programming (TPLP). KEYWORDS: Datalog; Query Answering; Ontologies; Query-plan; Data Indexing 1 Introduction Ontological reasoning services represent fundamental features in the development of the Semantic Web. Among them, scientists are focusing their attention on the so-called ontology-based query answering (OBQA), where a Boolean query has to be evaluated against a logical theory (knowledge base) consisting of an extensional database paired with an ontology (Cal ı et al. 2009; Ortiz 2013; Amendola et al. 2018).